3 Reading data

3.1 Types of data - structured and unstructured

Reading data is the first step to extract information from it. Data can exist broadly in two formats:

Structured data, and
Unstructured data.

Structured data is typically stored in a tabular form, where rows in the data correspond to “observations” and columns correspond to “variables”. For example, the following dataset contains 5 observations, where each observation (or row) consists of information about a movie. The variables (or columns) contain different pieces of information about a given movie. As all variables for a given row are related to the same movie, the data below is also called relational data.

	Title	US Gross	Production Budget	Release Date	Major Genre	Creative Type	Rotten Tomatoes Rating	IMDB Rating
0	The Shawshank Redemption	28241469	25000000	Sep 23 1994	Drama	Fiction	88	9.2
1	Inception	285630280	160000000	Jul 16 2010	Horror/Thriller	Fiction	87	9.1
2	One Flew Over the Cuckoo's Nest	108981275	4400000	Nov 19 1975	Comedy	Fiction	96	8.9
3	The Dark Knight	533345358	185000000	Jul 18 2008	Action/Adventure	Fiction	93	8.9
4	Schindler's List	96067179	25000000	Dec 15 1993	Drama	Non-Fiction	97	8.9

Unstructured data is data that is not organized in any pre-defined manner. Examples of unstructured data can be text files, audio/video files, images, Internet of Things (IoT) data, etc. Unstructured data is relatively harder to analyze as most of the analytical methods and tools are oriented towards structured data. However, an unstructured data can be used to obtain structured data, which in turn can be analyzed. For example, an image can be converted to an array of pixels - which will be structured data. Machine learning algorithms can then be used on the array to classify the image as that of a dog or a cat.

In this course, we will focus on analyzing structured data.

3.2 Reading a csv file with Pandas

Structured data can be stored in a variety of formats. The most popular format is data_file_name.csv, where the extension csv stands for comma separated values. The variable values of each observation are separated by a comma in a .csv file. In other words, the delimiter is a comma in a csv file. However, the comma is not visible when a .csv file is opened with Microsoft Excel.

3.2.1 Using the read_csv function

We will use functions from the Pandas library of Python to read data. Let us import Pandas to use its functions.

import pandas as pd

Note that pd is the acronym that we will use to call a Pandas function. This acronym can be anything as desired by the user.

The function to read a csv file is read_csv(). It reads the dataset into an object of type Pandas DataFrame. Let us read the dataset movie_ratings.csv in Python.

movie_ratings = pd.read_csv('movie_ratings.csv')

The built-in python function type can be used to check the dataype of an object:

type(movie_ratings)

pandas.core.frame.DataFrame

Note that the file movie_ratings.csv is stored at the same location as the python script containing the above code. If that is not the case, we’ll need to specify the location of the file as in the following code.

movie_ratings = pd.read_csv('D:/Books/DataScience_Intro_python/movie_ratings.csv')

Note that forward slash is used instead of backslash while specifying the path of the data file. Another option is to use two consecutive backslashes instead of a single forward slash.

3.2.2 Specifying the working directory

In case we need to read several datasets from a given location, it may be inconvenient to specify the path every time. In such a case we can change the current working directory to the location where the datasets are located.

We’ll use the os library of Python to view and/or change the current working directory.

import os #Importing the 'os' library
os.getcwd() #Getting the path to the current working directory

C:\Users\username\STAT303-1\Quarto Book\DataScience_Intro_python

The function getcwd() stands for get current working directory.

Suppose the dataset to be read is located at 'D:\Books\DataScience_Intro_python\Datasets'. Then, we’ll use the function chdir to change the current working directory to this location.

os.chdir('D:/Books/DataScience_Intro_python/Datasets')

Now we can read the dataset from this location without mentioning the entire path as shown below.

movie_ratings = pd.read_csv('movie_ratings.csv')

3.2.3 Data overview

Once the data has been read, we may want to see what the data looks like. We’ll use another Pandas function head() to view the first few rows of the data.

movie_ratings.head()

	Title	US Gross	Worldwide Gross	Production Budget	Release Date	MPAA Rating	Source	Major Genre	Creative Type	IMDB Rating	IMDB Votes
0	Opal Dreams	14443	14443	9000000	Nov 22 2006	PG/PG-13	Adapted screenplay	Drama	Fiction	6.5	468
1	Major Dundee	14873	14873	3800000	Apr 07 1965	PG/PG-13	Adapted screenplay	Western/Musical	Fiction	6.7	2588
2	The Informers	315000	315000	18000000	Apr 24 2009	R	Adapted screenplay	Horror/Thriller	Fiction	5.2	7595
3	Buffalo Soldiers	353743	353743	15000000	Jul 25 2003	R	Adapted screenplay	Comedy	Fiction	6.9	13510
4	The Last Sin Eater	388390	388390	2200000	Feb 09 2007	PG/PG-13	Adapted screenplay	Drama	Fiction	5.7	1012

3.2.3.1 Row Indices and column names (axis labels)

The bold integers on the left are the indices of the DataFrame. Each index refers to a distinct row. For example, the index 2 correponds to the row of the movie The Informers. By default, the indices are integers starting from 0. However, they can be changed (to even non-integer values) if desired by the user.

The bold text on top of the DataFrame refers to column names. For example, the column US Gross consists of the gross revenue of a movie in the US.

Collectively, the indices and column names are referred as axis labels.

3.2.3.2 Shape of DataFrame

For finding the number of rows and columns in the data, you may use the shape() function.

#Finding the shape of movie_ratings dataset
movie_ratings.shape

(2228, 11)

The movie_ratings dataset contains 2,228 observations (or rows) and 11 variables (or columns).

3.2.4 Summary statistics

3.2.4.1 Numeric columns summary

The Pandas function of the DataFrame class, describe() can be used very conviniently to print the summary statistics of numeric columns of the data.

#Finding summary statistics of movie_ratings dataset
movie_ratings.describe()

Table 3.1: Summary statistics of numeric variables

	US Gross	Worldwide Gross	Production Budget	IMDB Rating	IMDB Votes	Release Year
count	2.228000e+03	2.228000e+03	2.228000e+03	2228.000000	2228.000000	2228.000000
mean	5.076370e+07	1.019370e+08	3.816055e+07	6.239004	33585.154847	2002.005386
std	6.643081e+07	1.648589e+08	3.782604e+07	1.243285	47325.651561	5.524324
min	0.000000e+00	8.840000e+02	2.180000e+02	1.400000	18.000000	1953.000000
25%	9.646188e+06	1.320737e+07	1.200000e+07	5.500000	6659.250000	1999.000000
50%	2.838649e+07	4.266892e+07	2.600000e+07	6.400000	18169.000000	2002.000000
75%	6.453140e+07	1.200000e+08	5.300000e+07	7.100000	40092.750000	2006.000000
max	7.601676e+08	2.767891e+09	3.000000e+08	9.200000	519541.000000	2039.000000

Answer the following questions based on the above table.

3.2.4.2 Summary statistics across rows/columns

The Pandas DataFrame class has functions such as sum() and mean() to compute sum over rows or columns of a DataFrame.

Let us compute the mean of all the numeric columns of the data:

movie_ratings.mean(axis = 0)

US Gross             5.076370e+07
Worldwide Gross      1.019370e+08
Production Budget    3.816055e+07
IMDB Rating          6.239004e+00
IMDB Votes           3.358515e+04
dtype: float64

The argument axis=0 denotes that the mean is taken over all the rows of the DataFrame. For computing a statistic across columns the argument axis=1 will be used.

If mean over a subset of columns is desired, then those column names can be subset from the data. For example, let us compute the mean IMDB rating, and mean IMDB votes of all the movies:

movie_ratings[['IMDB Rating','IMDB Votes']].mean(axis = 0)

IMDB Rating        6.239004
IMDB Votes     33585.154847
dtype: float64

3.2.5 Practice exercise 1

Read the file Top 10 Albums By Year.csv. This file contains the top 10 albums for each year from 1990 to 2021. Each row corresponds to a unique album.

3.2.5.1

Print the first 5 rows of the data.

album_data = pd.read_csv('./Datasets/Top 10 Albums By Year.csv')
album_data.head()

	Year	Ranking	Artist	Album	Worldwide Sales	CDs	Tracks	Album Length	Hours	Minutes	Seconds	Genre
0	1990	8	Phil Collins	Serious Hits... Live!	9956520	1	15	1:16:53	1.28	76.88	4613	Rock
1	1990	1	Madonna	The Immaculate Collection	30000000	1	17	1:13:32	1.23	73.53	4412	Pop
2	1990	10	The Three Tenors	Carreras Domingo Pavarotti In Concert 1990	8533000	1	17	1:07:55	1.13	67.92	4075	Classical
3	1990	4	MC Hammer	Please Hammer Don't Hurt Em	18000000	1	13	0:59:04	0.98	59.07	3544	Hip Hop
4	1990	6	Movie Soundtrack	Aashiqui	15000000	1	12	0:58:13	0.97	58.22	3493	World

3.2.5.2

How many rows and columns are there in the data?

album_data.shape

(320, 12)

There are 320 rows and 12 columns in the data

3.2.5.3

Print the summary statistics of the data, and answer the following questions:

What proportion of albums have 15 or lesser tracks? Mention a range for the proportion.
What is the mean length of a track (in minutes)?

album_data.describe()

	Year	Ranking	CDs	Tracks	Hours	Minutes	Seconds
count	320.000000	320.00000	320.000000	320.000000	320.000000	320.000000	320.000000
mean	2005.500000	5.50000	1.043750	14.306250	0.941406	56.478500	3388.715625
std	9.247553	2.87678	0.246528	5.868995	0.382895	22.970109	1378.209812
min	1990.000000	1.00000	1.000000	6.000000	0.320000	19.430000	1166.000000
25%	1997.750000	3.00000	1.000000	12.000000	0.740000	44.137500	2648.250000
50%	2005.500000	5.50000	1.000000	13.000000	0.860000	51.555000	3093.500000
75%	2013.250000	8.00000	1.000000	15.000000	1.090000	65.112500	3906.750000
max	2021.000000	10.00000	4.000000	67.000000	5.070000	304.030000	18242.000000

At least 75% of the albums have 15 tracks or lesser tracks since the 75th percentile value of the number of tracks is 15. However, albums between those having 75th percentile value for the number of tracks and those having the maximum number of tracks can also have 15 tracks. Thus, the proportion of albums having 15 or lesser tracks = [75%-99.99%].

print("Mean length of a track =",56.478500/14.306250, "minutes")

Mean length of a track = 3.9478200087374398 minutes

3.2.6 Creating new columns from existing columns

New variables (or columns) can be created based on existing variables, or with external data (we’ll see adding external data later). For example, let us create a new variable ratio_wgross_by_budget, which is the ratio of Worldwide Gross and Production Budget for each movie:

movie_ratings['ratio_wgross_by_budget'] = movie_ratings['Worldwide Gross']/movie_ratings['Production Budget']

The new variable can be seen at the right end of the updated DataFrame as shown below.

movie_ratings.head()

	Title	US Gross	Worldwide Gross	Production Budget	Release Date	MPAA Rating	Source	Major Genre	Creative Type	IMDB Rating	IMDB Votes	ratio_wgross_by_budget
0	Opal Dreams	14443	14443	9000000	Nov 22 2006	PG/PG-13	Adapted screenplay	Drama	Fiction	6.5	468	0.001605
1	Major Dundee	14873	14873	3800000	Apr 07 1965	PG/PG-13	Adapted screenplay	Western/Musical	Fiction	6.7	2588	0.003914
2	The Informers	315000	315000	18000000	Apr 24 2009	R	Adapted screenplay	Horror/Thriller	Fiction	5.2	7595	0.017500
3	Buffalo Soldiers	353743	353743	15000000	Jul 25 2003	R	Adapted screenplay	Comedy	Fiction	6.9	13510	0.023583
4	The Last Sin Eater	388390	388390	2200000	Feb 09 2007	PG/PG-13	Adapted screenplay	Drama	Fiction	5.7	1012	0.176541

3.2.7 Datatype of variables

Note that in Table 3.1 (summary statistics), we don’t see Release Date. This is because the datatype of Release Data is not numeric.

The datatype of each variable can be seen using the dtypes() function of the DataFrame class.

#Checking the datatypes of the variables 
movie_ratings.dtypes

Title                 object
US Gross               int64
Worldwide Gross        int64
Production Budget      int64
Release Date          object
MPAA Rating           object
Source                object
Major Genre           object
Creative Type         object
IMDB Rating          float64
IMDB Votes             int64
dtype: object

Often, we wish to convert the datatypes of some of the variables to make them suitable for analysis. For example, the datatype of Release Date in the DataFrame movie_ratings is object. To perform numerical computations on this variable, we’ll need to convert it to a datetime format. We’ll use the Pandas function to_datetime() to covert it to a datetime format. Similar functions such as to_numeric(), to_string() etc., can be used for other conversions.

pd.to_datetime(movie_ratings['Release Date'])

0      2006-11-22
1      1965-04-07
2      2009-04-24
3      2003-07-25
4      2007-02-09
          ...    
2223   2004-07-07
2224   1998-06-19
2225   2010-05-14
2226   1991-06-14
2227   1998-01-23
Name: Release Date, Length: 2228, dtype: datetime64[ns]

We can see above that the function to_datetime() converts Release Date to a datetime format.

Now, we’ll update the variable Release Date in the DataFrame to be in the datetime format:

movie_ratings['Release Date'] = pd.to_datetime(movie_ratings['Release Date'])

movie_ratings.dtypes

Title                        object
US Gross                      int64
Worldwide Gross               int64
Production Budget             int64
Release Date         datetime64[ns]
MPAA Rating                  object
Source                       object
Major Genre                  object
Creative Type                object
IMDB Rating                 float64
IMDB Votes                    int64
dtype: object

We can see that the datatype of Release Date has changed to datetime in the updated DataFrame, movie_ratings. Now we can perform computations on Release Date. Suppose we wish to create a new variable Release_year that consists of the year of release of the movie. We’ll use the attribute year of the datetime module to extract the year from Release Date:

#Extracting year from Release Date
movie_ratings['Release Year'] = movie_ratings['Release Date'].dt.year

movie_ratings.head()

	Title	US Gross	Worldwide Gross	Production Budget	Release Date	MPAA Rating	Source	Major Genre	Creative Type	IMDB Rating	IMDB Votes	Release Year
0	Opal Dreams	14443	14443	9000000	2006-11-22	PG/PG-13	Adapted screenplay	Drama	Fiction	6.5	468	2006
1	Major Dundee	14873	14873	3800000	1965-04-07	PG/PG-13	Adapted screenplay	Western/Musical	Fiction	6.7	2588	1965
2	The Informers	315000	315000	18000000	2009-04-24	R	Adapted screenplay	Horror/Thriller	Fiction	5.2	7595	2009
3	Buffalo Soldiers	353743	353743	15000000	2003-07-25	R	Adapted screenplay	Comedy	Fiction	6.9	13510	2003
4	The Last Sin Eater	388390	388390	2200000	2007-02-09	PG/PG-13	Adapted screenplay	Drama	Fiction	5.7	1012	2007

As year is a numeric variable, it will appear in the numeric summary statistics with the describe() function, as shown below.

movie_ratings.describe()

	US Gross	Worldwide Gross	Production Budget	IMDB Rating	IMDB Votes	Release Year
count	2.228000e+03	2.228000e+03	2.228000e+03	2228.000000	2228.000000	2228.000000
mean	5.076370e+07	1.019370e+08	3.816055e+07	6.239004	33585.154847	2002.005386
std	6.643081e+07	1.648589e+08	3.782604e+07	1.243285	47325.651561	5.524324
min	0.000000e+00	8.840000e+02	2.180000e+02	1.400000	18.000000	1953.000000
25%	9.646188e+06	1.320737e+07	1.200000e+07	5.500000	6659.250000	1999.000000
50%	2.838649e+07	4.266892e+07	2.600000e+07	6.400000	18169.000000	2002.000000
75%	6.453140e+07	1.200000e+08	5.300000e+07	7.100000	40092.750000	2006.000000
max	7.601676e+08	2.767891e+09	3.000000e+08	9.200000	519541.000000	2039.000000

3.2.8 Practice exercise 2

3.2.8.1

Why is Worldwide Sales not included in the summary statistics table printed in Practice exercise 1?

album_data.dtypes

Year                 int64
Ranking              int64
Artist              object
Album               object
Worldwide Sales     object
CDs                  int64
Tracks               int64
Album Length        object
Hours              float64
Minutes            float64
Seconds              int64
Genre               object
dtype: object

Worldwide Sales is not included in the summary statistics table printed in Practice exercise 1 because its data type is object and not int or float

3.2.8.2

Update the DataFrame so that Worldwide Sales is included in the summary statistics table. Print the summary statistics table.

Hint: Sometimes it may not be possible to convert an object to numeric(). For example, the object ‘hi’ cannot be converted to a numeric() by the python compiler. To avoid getting an error, use the errors argument of to_numeric() to force such conversions to NaN (missing value).

album_data['Worldwide Sales'] = pd.to_numeric(album_data['Worldwide Sales'], errors = 'coerce')
album_data.describe()

	Year	Ranking	Worldwide Sales	CDs	Tracks	Hours	Minutes	Seconds
count	320.000000	320.00000	3.190000e+02	320.000000	320.000000	320.000000	320.000000	320.000000
mean	2005.500000	5.50000	1.071093e+07	1.043750	14.306250	0.941406	56.478500	3388.715625
std	9.247553	2.87678	7.566796e+06	0.246528	5.868995	0.382895	22.970109	1378.209812
min	1990.000000	1.00000	1.909009e+06	1.000000	6.000000	0.320000	19.430000	1166.000000
25%	1997.750000	3.00000	5.000000e+06	1.000000	12.000000	0.740000	44.137500	2648.250000
50%	2005.500000	5.50000	8.255866e+06	1.000000	13.000000	0.860000	51.555000	3093.500000
75%	2013.250000	8.00000	1.400000e+07	1.000000	15.000000	1.090000	65.112500	3906.750000
max	2021.000000	10.00000	4.500000e+07	4.000000	67.000000	5.070000	304.030000	18242.000000

3.2.8.3

Create a new column that computes the average worldwide sales per year for each album, assuming that the worldwide sales are as of 2022. Print the first 5 rows of the updated DataFrame.

album_data['mean_sales_per_year'] = album_data['Worldwide Sales']/(2022-album_data['Year'])
album_data.head()

	Year	Ranking	Artist	Album	Worldwide Sales	CDs	Tracks	Album Length	Hours	Minutes	Seconds	Genre	mean_sales_per_year
0	1990	8	Phil Collins	Serious Hits... Live!	9956520.0	1	15	1:16:53	1.28	76.88	4613	Rock	311141.25
1	1990	1	Madonna	The Immaculate Collection	30000000.0	1	17	1:13:32	1.23	73.53	4412	Pop	937500.00
2	1990	10	The Three Tenors	Carreras Domingo Pavarotti In Concert 1990	8533000.0	1	17	1:07:55	1.13	67.92	4075	Classical	266656.25
3	1990	4	MC Hammer	Please Hammer Don't Hurt Em	18000000.0	1	13	0:59:04	0.98	59.07	3544	Hip Hop	562500.00
4	1990	6	Movie Soundtrack	Aashiqui	15000000.0	1	12	0:58:13	0.97	58.22	3493	World	468750.00

3.2.9 Reading a sub-set of data: `loc` and `iloc`

Sometimes we may be interested in working with a subset of rows and columns of the data, instead of working with the entire dataset. The indexing operators loc and iloc provide a convenient way of selecting a subset of desired rows and columns. The operator loc uses axis labels (row indices and column names) to subset the data, while iloc uses the position of rows or columns, where position has values 0,1,2,3,…and so on, for rows from top to bottom and columns from left to right. In other words, the first row has position 0, the second row has position 1, the third row has position 2, and so on. Similarly, the first column from left has position 0, the second column from left has position 1, the third column from left has position 2, and so on.

Let us read the file movie_IMDBratings_sorted.csv, which has movies sorted in the descending order of their IMDB ratings.

movies_sorted = pd.read_csv('./Datasets/movie_IMDBratings_sorted.csv',index_col = 0)

The argument index_col=0 assigns the first column of the file as the row indices of the DataFrame.

movies_sorted.head()

	Title	US Gross	Worldwide Gross	Production Budget	Release Date	MPAA Rating	Source	Major Genre	Creative Type	IMDB Rating	IMDB Votes
Rank
1	The Shawshank Redemption	28241469	28241469	25000000	Sep 23 1994	R	Adapted screenplay	Drama	Fiction	9.2	519541
2	Inception	285630280	753830280	160000000	Jul 16 2010	PG/PG-13	Original Screenplay	Horror/Thriller	Fiction	9.1	188247
3	The Dark Knight	533345358	1022345358	185000000	Jul 18 2008	PG/PG-13	Adapted screenplay	Action/Adventure	Fiction	8.9	465000
4	Schindler's List	96067179	321200000	25000000	Dec 15 1993	R	Adapted screenplay	Drama	Non-Fiction	8.9	276283
5	Pulp Fiction	107928762	212928762	8000000	Oct 14 1994	R	Original Screenplay	Drama	Fiction	8.9	417703

Let us say, we wish to subset the title, worldwide gross, production budget, and IMDB rating of top 3 movies.

# Subsetting the DataFrame by loc - using axis labels
movies_subset = movies_sorted.loc[1:3,['Title','Worldwide Gross','Production Budget','IMDB Rating']]
movies_subset

	Title	Worldwide Gross	Production Budget	IMDB Rating
Rank
1	The Shawshank Redemption	28241469	25000000	9.2
2	Inception	753830280	160000000	9.1
3	The Dark Knight	1022345358	185000000	8.9

# Subsetting the DataFrame by iloc - using index of the position of rows and columns
movies_subset = movies_sorted.iloc[0:3,[0,2,3,9]]
movies_subset

	Title	Worldwide Gross	Production Budget	IMDB Rating
Rank
1	The Shawshank Redemption	28241469	25000000	9.2
2	Inception	753830280	160000000	9.1
3	The Dark Knight	1022345358	185000000	8.9

Let us find the movie with the maximum Worldwide Gross.

We will use the argmax() function of the Pandas Series class to find the position of the movie with the maximum worldwide gross, and then use the position to find the movie.

position_max_wgross = movies_sorted['Worldwide Gross'].argmax()

movies_sorted.iloc[position_max_wgross,:]

Title                             Avatar
US Gross                       760167650
Worldwide Gross               2767891499
Production Budget              237000000
Release Date                 Dec 18 2009
MPAA Rating                     PG/PG-13
Source               Original Screenplay
Major Genre             Action/Adventure
Creative Type                    Fiction
IMDB Rating                          8.3
IMDB Votes                        261439
Name: 59, dtype: object

Avatar has the highest worldwide gross of all the movies. Note that the : indicates that all the columns of the DataFrame are selected.

Key differences betweenloc and iloc in pandas

Indexing Type:
- loc uses labels (names) for indexing.
- iloc uses integer positions for indexing.
Inclusion of Endpoints:
- In a loc slice, both endpoints are included.
- In an iloc slice, the last endpoint is excluded.

Examples: Assuming you have a DataFrame like this:

data = {'A': [1, 2, 3, 4, 5],
        'B': [10, 20, 30, 40, 50],
        'C': [100, 200, 300, 400, 500]}

df = pd.DataFrame(data, index=['row1', 'row2', 'row3', 'row4', 'row5'])
df.style.set_properties({'text-align': 'right'})
df

	A	B	C
row1	1	10	100
row2	2	20	200
row3	3	30	300
row4	4	40	400
row5	5	50	500

# using 'loc'
df.loc['row2':'row4', 'B']

row2    20
row3    30
row4    40
Name: B, dtype: int64

# using 'iloc'
df.iloc[1:4, 1]

row2    20
row3    30
row4    40
Name: B, dtype: int64

Note that in the loc example, both ‘row2’ and ‘row4’ are included in the result, whereas in the iloc example, the rows at integer positions 1, 2 and 3 are included, but the row at position 4 is excluded.

3.2.10 Practice exercise 3

3.2.10.1

Find the album having the highest worldwide sales per year, and its artist.

album_data.iloc[album_data['mean_sales_per_year'].argmax(),:]

Year                        2021
Ranking                        1
Artist                     Adele
Album                         30
Worldwide Sales        4485025.0
CDs                            1
Tracks                        12
Album Length             0:58:14
Hours                       0.97
Minutes                    58.23
Seconds                     3494
Genre                        Pop
mean_sales_per_year    4485025.0
Name: 312, dtype: object

‘30’ has the highest worldwide sales and its artist is Adele.

3.2.10.2

Subset the data to include only Hip-Hop albums. How many Hip_Hop albums are there?

hiphop_albums = album_data.loc[album_data['Genre']=='Hip Hop',:]
print("There are",hiphop_albums.shape[0], "hip-hop albums")

There are 42 hip-hop albums

3.2.10.3

Which album amongst hip-hop has the higest mean sales per year per track, and who is its artist?

hiphop_albums.loc[:,'mean_sales_per_year_track'] = hiphop_albums.loc[:,'Worldwide Sales']/((2022-hiphop_albums.loc[:,'Year'])*(hiphop_albums.loc[:,'Tracks']))
hiphop_albums.iloc[hiphop_albums['mean_sales_per_year_track'].argmax(),:]

Year                                  2021
Ranking                                  6
Artist                           Cai Xukun
Album                                    迷
Worldwide Sales                  3402981.0
CDs                                      1
Tracks                                  11
Album Length                       0:24:16
Hours                                  0.4
Minutes                              24.27
Seconds                               1456
Genre                              Hip Hop
mean_sales_per_year              3402981.0
mean_sales_per_year_track    309361.909091
Name: 318, dtype: object

迷 has the higest mean sales per year per track amongst hip-hop albumns, and its artist is Cai Xukun.

3.3 Reading other data formats - txt, html, json

Although csv is a very popular format for structured data, data is found in several other formats as well. Some of the other data formats are txt, html and json.

3.3.1 Reading txt files

The txt format offers some additional flexibility as compared to the csv format. In the csv format, the delimiter is a comma (or the column values are separated by a comma). However, in a txt file, the delimiter can be anything as desired by the user. Let us read the file movie_ratings.txt, where the variable values are separated by a tab character.

movie_ratings_txt = pd.read_csv('movie_ratings.txt',sep='\t')

We use the function read_csv to read a txt file. However, we mention the tab character (r”) as a separator of variable values.

Note that there is no need to remember the argument name - sep for specifying the delimiter. You can always refer to the read_csv() documentation to find the relevant argument.

3.3.2 Practice exercise 4

Read the file bestseller_books.txt. It contains top 50 best-selling books on amazon from 2009 to 2019. Identify the delimiter without opening the file with Notepad or a text-editing software. How many rows and columns are there in the dataset?

Solution:

#Reading some lines with 'error_bad_lines=False' to identify the delimiter
bestseller_books = pd.read_csv('./Datasets/bestseller_books.txt',error_bad_lines=False)
bestseller_books.head()

b'Skipping line 6: expected 1 fields, saw 2\nSkipping line 10: expected 1 fields, saw 3\nSkipping line 16: expected 1 fields, saw 5\nSkipping line 17: expected 1 fields, saw 4\nSkipping line 20: expected 1 fields, saw 3\nSkipping line 29: expected 1 fields, saw 2\nSkipping line 33: expected 1 fields, saw 2\nSkipping line 40: expected 1 fields, saw 2\nSkipping line 41: expected 1 fields, saw 2\nSkipping line 42: expected 1 fields, saw 3\nSkipping line 43: expected 1 fields, saw 3\nSkipping line 44: expected 1 fields, saw 2\nSkipping line 60: expected 1 fields, saw 4\nSkipping line 61: expected 1 fields, saw 3\nSkipping line 63: expected 1 fields, saw 2\nSkipping line 64: expected 1 fields, saw 2\nSkipping line 70: expected 1 fields, saw 3\nSkipping line 71: expected 1 fields, saw 2\nSkipping line 72: expected 1 fields, saw 2\nSkipping line 73: expected 1 fields, saw 2\nSkipping line 80: expected 1 fields, saw 4\nSkipping line 82: expected 1 fields, saw 2\nSkipping line 94: expected 1 fields, saw 4\nSkipping line 95: expected 1 fields, saw 2\nSkipping line 96: expected 1 fields, saw 2\nSkipping line 101: expected 1 fields, saw 3\nSkipping line 119: expected 1 fields, saw 3\nSkipping line 130: expected 1 fields, saw 2\nSkipping line 131: expected 1 fields, saw 2\nSkipping line 132: expected 1 fields, saw 2\nSkipping line 133: expected 1 fields, saw 2\nSkipping line 148: expected 1 fields, saw 3\nSkipping line 149: expected 1 fields, saw 3\nSkipping line 150: expected 1 fields, saw 3\nSkipping line 154: expected 1 fields, saw 3\nSkipping line 155: expected 1 fields, saw 2\nSkipping line 156: expected 1 fields, saw 3\nSkipping line 157: expected 1 fields, saw 2\nSkipping line 158: expected 1 fields, saw 2\nSkipping line 159: expected 1 fields, saw 2\nSkipping line 177: expected 1 fields, saw 4\nSkipping line 178: expected 1 fields, saw 2\nSkipping line 179: expected 1 fields, saw 2\nSkipping line 183: expected 1 fields, saw 3\nSkipping line 209: expected 1 fields, saw 2\nSkipping line 215: expected 1 fields, saw 3\nSkipping line 224: expected 1 fields, saw 3\nSkipping line 230: expected 1 fields, saw 2\nSkipping line 241: expected 1 fields, saw 2\nSkipping line 247: expected 1 fields, saw 2\nSkipping line 248: expected 1 fields, saw 2\nSkipping line 249: expected 1 fields, saw 2\nSkipping line 250: expected 1 fields, saw 2\nSkipping line 251: expected 1 fields, saw 2\nSkipping line 252: expected 1 fields, saw 2\nSkipping line 253: expected 1 fields, saw 2\nSkipping line 254: expected 1 fields, saw 2\nSkipping line 259: expected 1 fields, saw 3\nSkipping line 273: expected 1 fields, saw 2\nSkipping line 274: expected 1 fields, saw 2\nSkipping line 275: expected 1 fields, saw 2\nSkipping line 276: expected 1 fields, saw 2\nSkipping line 277: expected 1 fields, saw 2\nSkipping line 278: expected 1 fields, saw 2\nSkipping line 279: expected 1 fields, saw 2\nSkipping line 280: expected 1 fields, saw 2\nSkipping line 281: expected 1 fields, saw 2\nSkipping line 282: expected 1 fields, saw 2\nSkipping line 292: expected 1 fields, saw 4\nSkipping line 293: expected 1 fields, saw 4\nSkipping line 295: expected 1 fields, saw 7\nSkipping line 296: expected 1 fields, saw 7\nSkipping line 297: expected 1 fields, saw 2\nSkipping line 302: expected 1 fields, saw 3\nSkipping line 315: expected 1 fields, saw 3\nSkipping line 321: expected 1 fields, saw 2\nSkipping line 346: expected 1 fields, saw 3\nSkipping line 347: expected 1 fields, saw 3\nSkipping line 365: expected 1 fields, saw 2\nSkipping line 408: expected 1 fields, saw 2\nSkipping line 420: expected 1 fields, saw 2\nSkipping line 421: expected 1 fields, saw 2\nSkipping line 430: expected 1 fields, saw 2\nSkipping line 434: expected 1 fields, saw 2\nSkipping line 446: expected 1 fields, saw 2\nSkipping line 448: expected 1 fields, saw 2\nSkipping line 449: expected 1 fields, saw 4\nSkipping line 451: expected 1 fields, saw 3\nSkipping line 458: expected 1 fields, saw 2\nSkipping line 459: expected 1 fields, saw 2\nSkipping line 460: expected 1 fields, saw 2\nSkipping line 465: expected 1 fields, saw 2\nSkipping line 470: expected 1 fields, saw 2\nSkipping line 471: expected 1 fields, saw 2\nSkipping line 476: expected 1 fields, saw 2\nSkipping line 495: expected 1 fields, saw 2\nSkipping line 496: expected 1 fields, saw 2\nSkipping line 497: expected 1 fields, saw 2\nSkipping line 512: expected 1 fields, saw 5\nSkipping line 513: expected 1 fields, saw 2\nSkipping line 515: expected 1 fields, saw 2\nSkipping line 517: expected 1 fields, saw 3\nSkipping line 518: expected 1 fields, saw 3\nSkipping line 519: expected 1 fields, saw 3\nSkipping line 520: expected 1 fields, saw 3\nSkipping line 521: expected 1 fields, saw 3\nSkipping line 525: expected 1 fields, saw 3\nSkipping line 533: expected 1 fields, saw 3\nSkipping line 534: expected 1 fields, saw 3\n'

	;Unnamed: 0;Name;Author;User Rating;Reviews;Price;Year;Genre
0	0;0;10-Day Green Smoothie Cleanse;JJ Smith;4.7...
1	1;1;11/22/63: A Novel;Stephen King;4.6;2052;22...
2	2;2;12 Rules for Life: An Antidote to Chaos;Jo...
3	3;3;1984 (Signet Classics);George Orwell;4.7;2...
4	5;5;A Dance with Dragons (A Song of Ice and Fi...

#The delimiter seems to be ';' based on the output of the above code
bestseller_books = pd.read_csv('./Datasets/bestseller_books.txt',sep=';')
bestseller_books.head()

	Unnamed: 0	Unnamed: 0.1	Name	Author	User Rating	Reviews	Price	Year	Genre
0	0	0	10-Day Green Smoothie Cleanse	JJ Smith	4.7	17350	8	2016	Non Fiction
1	1	1	11/22/63: A Novel	Stephen King	4.6	2052	22	2011	Fiction
2	2	2	12 Rules for Life: An Antidote to Chaos	Jordan B. Peterson	4.7	18979	15	2018	Non Fiction
3	3	3	1984 (Signet Classics)	George Orwell	4.7	21424	6	2017	Fiction
4	4	4	5,000 Awesome Facts (About Everything!) (Natio...	National Geographic Kids	4.8	7665	12	2019	Non Fiction

#The file read with ';' as the delimited is correct
print("The file has",bestseller_books.shape[0],"rows and",bestseller_books.shape[1],"columns")

The file has 550 rows and 9 columns

Alternatively, you can use the argument sep = None, and engine = 'python'. The default engine is C. However, the ‘python’ engine has a ‘sniffer’ tool which may identify the delimiter automatically.

bestseller_books = pd.read_csv('./Datasets/bestseller_books.txt',sep=None, engine = 'python')
bestseller_books.head()

	Unnamed: 0	Unnamed: 0.1	Name	Author	User Rating	Reviews	Price	Year	Genre
0	0	0	10-Day Green Smoothie Cleanse	JJ Smith	4.7	17350	8	2016	Non Fiction
1	1	1	11/22/63: A Novel	Stephen King	4.6	2052	22	2011	Fiction
2	2	2	12 Rules for Life: An Antidote to Chaos	Jordan B. Peterson	4.7	18979	15	2018	Non Fiction
3	3	3	1984 (Signet Classics)	George Orwell	4.7	21424	6	2017	Fiction
4	4	4	5,000 Awesome Facts (About Everything!) (Natio...	National Geographic Kids	4.8	7665	12	2019	Non Fiction

3.3.3 Reading HTML data

The Pandas function read_html searches for tabular data, i.e., data contained within the <table> tags of an html file. Let us read the tables in the GDP per capita page on Wikipedia.

#Reading all the tables from the Wikipedia page on GDP per capita
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita')

All the tables will be read and stored in the variable named as tables. Let us find the datatype of the variable tables.

#Finidng datatype of the variable - tables
type(tables)

list

The variable - tables is a list of all the tables read from the HTML data.

#Number of tables read from the page
len(tables)

The in-built function len can be used to find the length of the list - tables or the number of tables read from the Wikipedia page. Let us check out the first table.

#Checking out the first table. Note that the index of the first table will be 0.
tables[0]

	0	1	2
0	.mw-parser-output .legend{page-break-inside:av...	$20,000 - $30,000 $10,000 - $20,000 $5,000 - $...	$1,000 - $2,500 $500 - $1,000 <$500 No data

The above table doesn’t seem to be useful. Let us check out the second table.

#Checking out the second table. Note that the index of the first table will be 1.
tables[1]

	Country/Territory	UN Region	IMF[4][5]		United Nations[6]		World Bank[7]
	Country/Territory	UN Region	Estimate	Year	Estimate	Year	Estimate	Year
0	Liechtenstein *	Europe	—	—	180227	2020	169049	2019
1	Monaco *	Europe	—	—	173696	2020	173688	2020
2	Luxembourg *	Europe	135046	2022	117182	2020	135683	2021
3	Bermuda *	Americas	—	—	123945	2020	110870	2021
4	Ireland *	Europe	101509	2022	86251	2020	85268	2020
...	...	...	...	...	...	...	...	...
212	Central AfricanRepublic *	Africa	527	2022	481	2020	477	2020
213	Sierra Leone *	Africa	513	2022	475	2020	485	2020
214	Madagascar *	Africa	504	2022	470	2020	496	2020
215	South Sudan *	Africa	393	2022	1421	2020	1120	2015
216	Burundi *	Africa	272	2022	286	2020	274	2020

217 rows × 8 columns

The above table contains the estimated GDP per capita of all countries. This is the table that is likely to be relevant to a user interested in analyzing GDP per capita of countries. Instead of reading all tables of an HTML file, we can focus the search to tables containing certain relevant keywords. Let us try searching all table containing the word ‘Country’.

#Reading all the tables from the Wikipedia page on GDP per capita, containing the word 'Country'
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita', match = 'Country')

The match argument can be used to specify the keywords to be present in the table to be read.

len(tables)

Only one table contains the keyword - ‘Country’. Let us check out the table obtained.

#Table having the keyword - 'Country' from the HTML page
tables[0]

	Country/Territory	UN Region	IMF[4][5]		United Nations[6]		World Bank[7]
	Country/Territory	UN Region	Estimate	Year	Estimate	Year	Estimate	Year
0	Liechtenstein *	Europe	—	—	180227	2020	169049	2019
1	Monaco *	Europe	—	—	173696	2020	173688	2020
2	Luxembourg *	Europe	135046	2022	117182	2020	135683	2021
3	Bermuda *	Americas	—	—	123945	2020	110870	2021
4	Ireland *	Europe	101509	2022	86251	2020	85268	2020
...	...	...	...	...	...	...	...	...
212	Central AfricanRepublic *	Africa	527	2022	481	2020	477	2020
213	Sierra Leone *	Africa	513	2022	475	2020	485	2020
214	Madagascar *	Africa	504	2022	470	2020	496	2020
215	South Sudan *	Africa	393	2022	1421	2020	1120	2015
216	Burundi *	Africa	272	2022	286	2020	274	2020

217 rows × 8 columns

The argument match helps with a more focussed search, and helps us discard irrelevant tables.

3.3.4 Practice exercise 5

Read the table(s) consisting of attendance of spectators in FIFA worlds cup from this page. Read only those table(s) that have the word ‘attendance’ in them. How many rows and columns are there in the table(s)?

dfs = pd.read_html('https://en.wikipedia.org/wiki/FIFA_World_Cup',
                       match='attendance')
print(len(dfs))
data = dfs[0]
print("Number of rows =",data.shape[0], "and number of columns=",data.shape[1])

1
Number of rows = 22 and number of columns= 9

3.3.5 Reading JSON data

JSON stands for JavaScript Object Notation, in which the data is stored and transmitted as plain text. A couple of benefits of the JSON format are:

Since the format is text only, JSON data can easily be exchanged between web applications, and used by any programming language.
Unlike the csv format, JSON supports a hierarchical data structure, and is easier to integrate with APIs.

The JSON format can support a hierachical data structure, as it is built on the following two data structures (Source: technical documentation):

A collection of name/value pairs. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative array.
An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.

These are universal data structures. Virtually all modern programming languages support them in one form or another. It makes sense that a data format that is interchangeable with programming languages also be based on these structures.

The Pandas function read_json converts a JSON string to a Pandas DataFrame. The function dumps() of the json library converts a Python object to a JSON string.

Lets read the JSON data on Ted Talks.

tedtalks_data = pd.read_json('https://raw.githubusercontent.com/cwkenwaysun/TEDmap/master/data/TED_Talks.json')

tedtalks_data.head()

	id	speaker	headline	URL	description	transcript_URL	month_filmed	year_filmed	event	duration	date_published	tags	newURL	date	views	rates
0	7	David Pogue	Simplicity sells	http://www.ted.com/talks/view/id/7	New York Times columnist David Pogue takes aim...	http://www.ted.com/talks/view/id/7/transcript?...	2	2006	TED2006	0:21:26	6/27/06	simplicity,computers,software,interface design...	https://www.ted.com/talks/david_pogue_says_sim...	2006-06-27	1646773	[{'id': 7, 'name': 'Funny', 'count': 968}, {'i...
1	6	Craig Venter	Sampling the ocean's DNA	http://www.ted.com/talks/view/id/6	Genomics pioneer Craig Venter takes a break fr...	http://www.ted.com/talks/view/id/6/transcript?...	7	2005	TEDGlobal 2005	0:16:51	2004/05/07	biotech,invention,oceans,genetics,DNA,biology,...	https://www.ted.com/talks/craig_venter_on_dna_...	2004-05-07	562625	[{'id': 3, 'name': 'Courageous', 'count': 21},...
2	4	Burt Rutan	The real future of space exploration	http://www.ted.com/talks/view/id/4	In this passionate talk, legendary spacecraft ...	http://www.ted.com/talks/view/id/4/transcript?...	2	2006	TED2006	0:19:37	10/25/06	aircraft,flight,industrial design,NASA,rocket ...	https://www.ted.com/talks/burt_rutan_sees_the_...	2006-10-25	2046869	[{'id': 3, 'name': 'Courageous', 'count': 169}...
3	3	Ashraf Ghani	How to rebuild a broken state	http://www.ted.com/talks/view/id/3	Ashraf Ghani's passionate and powerful 10-minu...	http://www.ted.com/talks/view/id/3/transcript?...	7	2005	TEDGlobal 2005	0:18:45	10/18/06	corruption,poverty,economics,investment,milita...	https://www.ted.com/talks/ashraf_ghani_on_rebu...	2006-10-18	814554	[{'id': 3, 'name': 'Courageous', 'count': 140}...
4	5	Chris Bangle	Great cars are great art	http://www.ted.com/talks/view/id/5	American designer Chris Bangle explains his ph...	http://www.ted.com/talks/view/id/5/transcript?...	2	2002	TED2002	0:20:04	2004/05/07	cars,industrial design,transportation,inventio...	https://www.ted.com/talks/chris_bangle_says_gr...	2004-05-07	870950	[{'id': 1, 'name': 'Beautiful', 'count': 89}, ...

3.3.6 Practice exercise 6

Read the movies dataset from here. How many rows and columns are there in the data?

movies_data = pd.read_json('https://raw.githubusercontent.com/vega/vega-datasets/master/data/movies.json')
print("Number of rows =",movies_data.shape[0], "and number of columns=",movies_data.shape[1])

Number of rows = 3201 and number of columns= 16

3.3.7 Reading data from web APIs

API, an acronym for Application programming interface, is a way of transferring information between systems. Many websites have public APIs that provide data via JSON or other formats. For example, the IMDb-API is a web service for receiving movies, serial, and cast information. API results are in the JSON format and include items such as movie specifications, ratings, Wikipedia page content, etc. One of these APIs contains ratings of the top 250 movies on IMDB. Let us read this data using the IMDB API.

We’ll use the get function from the python library requests to request data from the API and obtain a response code. The response code will let us know if our request to pull data from this API was successful.

#Importing the requests library
import requests as rq

# Downloading imdb top 250 movie's data
url = 'https://imdb-api.com/en/API/Top250Movies/k_v6gf8ppf' #URL of the API containing top 250 movies based on IMDB ratings
response = rq.get(url) #Requesting data from the API
response

<Response [200]>

We have a response code of 200, which indicates that the request was successful.

The response object’s JSON method will return a dictionary containing JSON parsed into native Python objects.

movie_data = response.json()

movie_data.keys()

dict_keys(['items', 'errorMessage'])

The movie_data contains only two keys. The items key seems likely to contain information about the top 250 movies. Let us convert the values of the items key (which is list of dictionaries) to a dataframe, so that we can view it in a tabular form.

#Converting a list of dictionaries to a dataframe
movie_data_df = pd.DataFrame(movie_data['items'])

#Checking the movie data pulled using the API
movie_data_df.head()

	id	rank	title	fullTitle	year	image	crew	imDbRating	imDbRatingCount
0	tt0111161	1	The Shawshank Redemption	The Shawshank Redemption (1994)	1994	https://m.media-amazon.com/images/M/MV5BMDFkYT...	Frank Darabont (dir.), Tim Robbins, Morgan Fre...	9.2	2624065
1	tt0068646	2	The Godfather	The Godfather (1972)	1972	https://m.media-amazon.com/images/M/MV5BM2MyNj...	Francis Ford Coppola (dir.), Marlon Brando, Al...	9.2	1817542
2	tt0468569	3	The Dark Knight	The Dark Knight (2008)	2008	https://m.media-amazon.com/images/M/MV5BMTMxNT...	Christopher Nolan (dir.), Christian Bale, Heat...	9.0	2595637
3	tt0071562	4	The Godfather Part II	The Godfather Part II (1974)	1974	https://m.media-amazon.com/images/M/MV5BMWMwMG...	Francis Ford Coppola (dir.), Al Pacino, Robert...	9.0	1248050
4	tt0050083	5	12 Angry Men	12 Angry Men (1957)	1957	https://m.media-amazon.com/images/M/MV5BMWU4N2...	Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb	8.9	775140

#Rows and columns of the movie data
movie_data_df.shape

(250, 9)

This API provides the names of the top 250 movies along with the year of release, IMDB ratings, and cast information.

3.4 Writing data

The Pandas function to_csv can be used to write (or export) data to a csv or txt file. Below are some examples.

Example 1: Let us export the movies data of the top 250 movies to a csv file.

#Exporting the data of the top 250 movies to a csv file
movie_data_df.to_csv('movie_data_exported.csv')

The file movie_data_exported.csv will appear in the working directory.

Example 2: Let us export the movies data of the top 250 movies to a txt file with a semi-colon as the delimiter.

movie_data_df.to_csv('movie_data_exported.txt',sep=';')

Example 3: Let us export the movies data of the top 250 movies to a JSON file.

with open('movie_data.json', 'w') as f:
    json.dump(movie_data, f)